The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Similar documents
Boosting: Foundations and Algorithms. Rob Schapire

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 8. Instructor: Haipeng Luo

Voting (Ensemble Methods)

COMS 4771 Lecture Boosting 1 / 16

2 Upper-bound of Generalization Error of AdaBoost

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Ensembles. Léon Bottou COS 424 4/8/2010

Hierarchical Boosting and Filter Generation

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Learning theory. Ensemble methods. Boosting. Boosting: history

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CSCI-567: Machine Learning (Spring 2019)

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Learning with multiple models. Boosting.

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Totally Corrective Boosting Algorithms that Maximize the Margin

Boosting: Algorithms and Applications

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Statistics and learning: Big Data

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Statistical Machine Learning from Data

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Ensemble Methods. Manfred Huber

Multiclass Boosting with Repartitioning

ECE 5984: Introduction to Machine Learning

6.036 midterm review. Wednesday, March 18, 15

10-701/ Machine Learning - Midterm Exam, Fall 2010

CS7267 MACHINE LEARNING

Boos$ng Can we make dumb learners smart?

TDT4173 Machine Learning

TDT4173 Machine Learning

Algorithm-Independent Learning Issues

Boosting with decision stumps and binary features

A Brief Introduction to Adaboost

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning, Fall 2011: Homework 5

Data Warehousing & Data Mining

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

VBM683 Machine Learning

i=1 = H t 1 (x) + α t h t (x)

Lecture 11. Linear Soft Margin Support Vector Machines

Stochastic Gradient Descent

Cross Validation & Ensembling

Multiclass Classification-1

Combining Classifiers

Chapter 14 Combining Models

Minimax risk bounds for linear threshold functions

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

ECE 5424: Introduction to Machine Learning

Data Mining und Maschinelles Lernen

CS229 Supplemental Lecture notes

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

More about the Perceptron

Large-Margin Thresholded Ensembles for Ordinal Regression

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Decision Trees: Overfitting

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Support Vector and Kernel Methods

MIRA, SVM, k-nn. Lirong Xia

Classification objectives COMS 4771

Support Vector Machine. Industrial AI Lab.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Robotics 2 AdaBoost for People and Place Detection

Machine Learning for NLP

Stephen Scott.

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Large-Margin Thresholded Ensembles for Ordinal Regression

Lecture 3: Decision Trees

Ensemble Methods for Machine Learning

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Dan Roth 461C, 3401 Walnut

Computational Learning Theory

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Geometry of U-Boost Algorithms

Midterm Exam Solutions, Spring 2007

An Introduction to Boosting and Leveraging

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Part of the slides are adapted from Ziko Kolter

Logistic Regression. Machine Learning Fall 2018

Transcription:

) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ m ) is better than chance. i=1 2) The new component is assigned votes based on its error: ˆα m =.5 log( (1 m )/ m ) 3) The weights are updated according to (Z m is chosen so that the new weights sum to one): W (m) i = 1 Z m W (m) i W (m 1) i exp{ y iˆα m h(x i ; ˆθ m ) } Tommi Jaakkola, MIT CSAIL 18

Adaboost properties: exponential loss After each boosting iteration, assuming we can find a component classifier whose weighted error is better than chance, the combined classifier ĥ m (x) =ˆα 1 h(x; ˆθ 1 )+...+ˆα m h(x; ˆθ m ) is guaranteed to have a lower exponential loss over the training examples 14 12 1 exponential loss 8 6 4 2 1 2 3 4 5 number of iterations Tommi Jaakkola, MIT CSAIL 2

Adaboost properties: training error The boosting iterations also decrease the classification error of the combined classifier ĥ m (x) =ˆα 1 h(x; ˆθ 1 )+...+ˆα m h(x; ˆθ m ) over the training examples. 6 4 2 training error.8.6.4.2 1 2 3 4 5 number of iterations Tommi Jaakkola, MIT CSAIL 21

Adaboost properties: training error cont d The training classification error has to go down exponentially fast if the weighted errors of the component classifiers, k, are strictly better than chance k <.5 m err(ĥm) 2 k (1 k ) k=1 6 4 2 training error.8.6.4.2 1 2 3 4 5 number of iterations Tommi Jaakkola, MIT CSAIL 22

Adaboost properties: weighted error Weighted error of each new component classifier k =.5 1 n W (k 1) i y i h(x i ; 2 ˆθ k ) i=1 tends to increase as a function of boosting iterations..4.35 weighted training error.3.25.2 5.5 1 2 3 4 5 number of iterations Tommi Jaakkola, MIT CSAIL 23

How Will Test Error Behave? (A First Guess) 1.8 error.6.4.2 test train 2 4 6 8 1 # of rounds ( T) expect: training error to continue to drop (or reach zero) test error to increase when H final becomes too complex Occam s razor overfitting hard to know when to stop training

Technically... with high probability: generalization error training error + Õ ( dt m ) bound depends on m = #trainingexamples d = complexity of weak classifiers T = # rounds generalization error = E [test error] predicts overfitting

Typical performance Training and test errors of the combined classifier ĥ m (x) =ˆα 1 h(x; ˆθ 1 )+...+ˆα m h(x; ˆθ m ) 6 4 2 training/test errors.8.6.4.2 1 2 3 4 5 number of iterations Why should the test error go down after we already have zero training error? Tommi Jaakkola, MIT CSAIL 24

AdaBoost and margin We can write the combined classifier in a more useful form by dividing the predictions by the total number of votes : ĥ m (x) = ˆα 1h(x; ˆθ 1 )+...+ˆα m h(x; ˆθ m ) ˆα 1 +...+ˆα m This allows us to define a clear notion of voting margin that the combined classifier achieves for each training example: margin(x i )=y i ĥm(x i ) The margin lies in [ 1, 1] and is negative for all misclassified examples. Tommi Jaakkola, MIT CSAIL 25

AdaBoost and margin Successive boosting iterations still improve the majority vote or margin for the training examples margin(x i ) = y i ˆα 1 h(x i ; ˆθ 1 )+...+ˆα m h(x i ; ˆθ m ) ˆα 1 +...+ˆα m Cumulative distributions of margin values: 1.9.8.7.6.5.4.3.2 1.9.8.7.6.5.4.3.2 1.5.5 1 1.5.5 1 4 iterations 1 iterations Tommi Jaakkola, MIT CSAIL 26

AdaBoost and margin Successive boosting iterations still improve the majority vote or margin for the training examples margin(x i ) = y i ˆα 1 h(x i ; ˆθ 1 )+...+ˆα m h(x i ; ˆθ m ) ˆα 1 +...+ˆα m Cumulative distributions of margin values: 1.9.8.7.6.5.4.3.2 1.9.8.7.6.5.4.3.2 1.5.5 1 1.5.5 1 2 iterations 5 iterations Tommi Jaakkola, MIT CSAIL 27

Can we improve the combination? As a result of running the boosting algorithm for m iterations, we essentially generate a new feature representation for the data φ i (x) =h(x; ˆθ i ),i=1,...,m Perhaps we can do better by separately estimating a new set of votes for each component. In other words, we could estimate a linear classifier of the form f(x; α) =α 1 φ 1 (x)+...α m φ m (x) where each parameter α i can be now any real number (even negative). The parameters would be estimated jointly rather than one after the other as in boosting. Tommi Jaakkola, MIT CSAIL 28

Can we improve the combination? We could use SVMs in a postprocessing step to reoptimize f(x; α) =α 1 φ 1 (x)+...α m φ m (x) with respect to α 1,...,α m. This is not necessarily a good idea. 6 6 4 4 2 2 training/test errors.8.6 training/test errors.8.6 typically.4.4.2.2 1 2 3 4 5 number of iterations boosting 1 2 3 4 5 number of components svm postprocessing Tommi Jaakkola, MIT CSAIL 29

Practical Advantages of AdaBoost fast simple and easy to program no parameters to tune (except T ) flexible can combine with any learning algorithm no prior knowledge needed about weak learner provably effective, provided can consistently find rough rules of thumb shift in mind set goal now is merely to find classifiers barely better than random guessing versatile can use with data that is textual, numeric, discrete, etc. has been extended to learning problems well beyond binary classification

Caveats performance of AdaBoost depends on data and weak learner consistent with theory, AdaBoost can fail if weak classifiers too complex overfitting weak classifiers too weak (γ t too quickly) underfitting low margins overfitting empirically, AdaBoost seems especially susceptible to uniform noise

Multiclass Problems say y Y where Y = k direct approach (AdaBoost.M1): [with Freund] D t+1 (i) = D t(i) Z t h t : X Y H final (x) =argmax y Y { e α t if y i = h t (x i ) e α t if y i h t (x i ) t:h t (x)=y can prove same bound on error if t : ɛ t 1/2 in practice, not usually a problem for strong weak learners (e.g., C4.5) significant problem for weak weak learners (e.g., decision stumps) instead, reduce to binary α t

The One-Against-All Approach break k-class problem into k binary problems and solve each separately say possible labels are Y = {,,, } x 1 x 1 x 1 + x 1 x 1 x 2 x 2 x 2 x 2 + x 2 x 3 x 3 x 3 x 3 x 3 + x 4 x 4 x 4 + x 4 x 4 x 5 x 5 + x 5 x 5 x 5 to classify new example, choose label predicted to be most positive AdaBoost.MH problem: not robust to errors in predictions [with Singer]